Kaggle实战之Titanic

题目来自:Titanic
参考资料来自:An Interactive Data Science Tutorial
Titanic 生存预测比赛是一个二分类问题,根据乘客的信息来判断是否在沉船事故中存活了下来。

首先还是导入必要的库:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Modelling Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier

# Modelling Helpers
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'white' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

其次是一些用于绘图的功能性函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def plot_histograms( df , variables , n_rows , n_cols ):
fig = plt.figure( figsize = ( 16 , 12 ) )
for i, var_name in enumerate( variables ):
ax=fig.add_subplot( n_rows , n_cols , i+1 )
df[ var_name ].hist( bins=10 , ax=ax )
ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
ax.set_xticklabels( [] , visible=False )
ax.set_yticklabels( [] , visible=False )
fig.tight_layout() # Improves appearance a bit.
plt.show()

def plot_distribution( df , var , target , **kwargs ):
row = kwargs.get( 'row' , None )
col = kwargs.get( 'col' , None )
facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
facet.map( sns.kdeplot , var , shade= True )
facet.set( xlim=( 0 , df[ var ].max() ) )
facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
row = kwargs.get( 'row' , None )
col = kwargs.get( 'col' , None )
facet = sns.FacetGrid( df , row = row , col = col )
facet.map( sns.barplot , cat , target )
facet.add_legend()

def plot_correlation_map( df ):
corr = titanic.corr()
_ , ax = plt.subplots( figsize =( 12 , 10 ) )
cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
_ = sns.heatmap(
corr,
cmap = cmap,
square=True,
cbar_kws={ 'shrink' : .9 },
ax=ax,
annot = True,
annot_kws = { 'fontsize' : 12 }
)

def describe_more( df ):
var = [] ; l = [] ; t = []
for x in df:
var.append( x )
l.append( len( pd.value_counts( df[ x ] ) ) )
t.append( df[ x ].dtypes )
levels = pd.DataFrame( { 'Variable' : var , 'Levels' : l , 'Datatype' : t } )
levels.sort_values( by = 'Levels' , inplace = True )
return levels

def plot_variable_importance( X , y ):
tree = DecisionTreeClassifier( random_state = 99 )
tree.fit( X , y )
plot_model_var_imp( tree , X , y )

def plot_model_var_imp( model , X , y ):
imp = pd.DataFrame(
model.feature_importances_ ,
columns = [ 'Importance' ] ,
index = X.columns
)
imp = imp.sort_values( [ 'Importance' ] , ascending = True )
imp[ : 10 ].plot( kind = 'barh' )
print (model.score( X , y ))

训练集与测试集

接下来就是导入训练集和测试集了,以及对两个数据集进行了合并,以便于后面进行数据分析,特征工程等:

1
2
3
4
5
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
full = train_data.append(test_data, ignore_index=True)
titanic = full[:891] # full是整个数据集,titanic是训练集
print('full:', full.shape, ';titanic:', titanic.shape)

输出:

full: (1309, 12) ;titanic: (891, 12)

数据分析

使用full.head()可以查看前几个数据的样式,如下所示:

使用titanic.info可以获取训练集每个column数据信息:

使用test_data.info可以获取测试集每个column数据信息:

关于各个column的信息如下:
Age:年龄,有中等数量的缺失
cabin:座舱号,有大量的缺失
Embarked:登船口,在训练集中有很少量的缺失(2个),包括C,Q,S三种
Fare:乘客的票价,在测试集中有很少量的缺失(1个)
Name:姓名
Parch:乘客的父母和孩子的个数
PassengerId: 自增数值,无意义
Pclass:票的等级,有三级:1,2,3
Sex:性别,male和female
SibSp:乘客的兄弟和配偶的个数
Survived:是否存活,0 = No, 1 = Yes
ticket:票的编号

绘制相关性的heat map,可能可以知道哪些变量是很重要的

1
plot_correlation_map(titanic)

输出如下图:

接下来绘制一些特征与存活与否之间的关系
首先是Age,Sex与Survived关系图:

1
plot_distribution(titanic , var = 'Age' , target = 'Survived' , row = 'Sex')

输出如下图:

两个线差别较大的地方,代表了更好的区分度。可以看到年龄小的男性更多的存活,中等年龄的男性更多的死亡

Fare和Survived的关系图:

1
plot_distribution(titanic , var = 'Fare' , target = 'Survived' )

输出如下图:

可以看到,低票价有着更高的死亡率

接下来看Embarked与Survived的关系:

1
2
print(titanic.Embarked.value_counts())
plot_categories( titanic , cat = 'Embarked' , target = 'Survived' )

输出:

S 644
C 168
Q 77
Name: Embarked, dtype: int64

可以看到S的数目是最多的,但是存活率是最低的

再看Sex与Survived的关系:

1
2
print(titanic.Sex.value_counts())
plot_categories( titanic , cat = 'Sex' , target = 'Survived' )

输出:

male 577
female 314
Name: Sex, dtype: int64

可以看到女性人数少,但是有着绝对的更大的存活率。

对于Pclass与Survived的关系

1
2
print(titanic.Pclass.value_counts())
plot_categories( titanic , cat = 'Pclass' , target = 'Survived' )

输出:

3 491
1 216
2 184
Name: Pclass, dtype: int64

等级1人最少,却有着最多的存活率,等级3人最多,却是最少的存活率

对于SibSp和Parch两个数据,可以进行求和,并分成0和不为0两类

1
2
3
4
titanic['Family_All'] = titanic['SibSp'] + titanic['Parch']
titanic['Family_All'] = [0 if i == 0 else 1 for i in titanic.Family_All]
print(titanic.Family_All.value_counts())
plot_categories( titanic , cat = 'Family_All' , target = 'Survived' )

输出:

0 537
1 354
Name: Family_All, dtype: int64

可以看到为0的,存活率相较于不为0的,是要低很多的

大致的通过图表分析过后,对原始数据进行些处理。

首先是将sex的male和female转为1和0

1
2
my_sex = pd.DataFrame()
my_sex['Sex'] = [1 if i == 'male' else 0 for i in full.Sex]

Embarked数据存在极少量的缺失,这里打算用最多的‘S’来填补,再使用pd.get_dummies来将多个变量转为one_hot编码

1
2
3
4
my_embarked = pd.DataFrame()
my_embarked['Embarked'] = full.Embarked.fillna('S')
my_embarked = pd.get_dummies(my_embarked.Embarked, prefix = 'Embarked')
my_embarked.head()

输出:

对于Pclass,没有缺失值,只需要转为one_hot即可

1
2
my_pclass = pd.DataFrame()
my_pclass = pd.get_dummies(full.Pclass, prefix='Pclass')

对于Fare,由于在测试集中有一个缺失值,所以可以采用平均数的方法来填补该缺失值,并且可以对Fare进行区间划分,并转为one_hot

1
2
3
4
5
my_fare = pd.DataFrame()
my_fare['Fare'] = full.Fare.fillna(full.Fare.mean())
my_fare['Fare'] = pd.qcut(my_fare['Fare'], 4)
my_fare = pd.get_dummies(my_fare.Fare, prefix='Fare')
my_fare.head()

对于Age,缺失值较多,可以根据已有数据的平均值和标准差随机生成填充数,并进行区间划分,转为one_hot

1
2
3
4
5
6
7
8
9
10
my_age = pd.DataFrame()
my_age['Age'] = full.Age
age_avg = full.Age.mean()
age_std = full.Age.std()
age_null_count = full.Age.isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
my_age['Age'][np.isnan(my_age['Age'])] = age_null_random_list
my_age['Age'] = pd.qcut(my_age['Age'], 4)
my_age = pd.get_dummies(my_age.Age, prefix='Age')
my_age.head()

根据Name中的内容可以生成title,并转为one_hot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
title = pd.DataFrame()
title['Title'] = full['Name'].map( lambda name: name.split( ',' )[1].split( '.' )[0].strip() )
Title_Dictionary = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
title['Title'] = title.Title.map(Title_Dictionary)
title = pd.get_dummies(title.Title)
title.head()

输出:

对于Parch和SibSp,合并为Family_All

1
2
3
my_family = pd.DataFrame()
my_family['Family_All'] = full['Parch'] + full['SibSp']
my_family['Family_All'] = [0 if i == 0 else 1 for i in my_family.Family_All]

Cabin的缺失值过多,先暂时舍弃
ticket也先舍弃

开始训练

将刚才处理过的数据进行一个综合,并生成训练集和测试集

1
2
3
4
full_X = pd.concat( [my_family, title, my_age, my_embarked, my_fare, my_pclass, my_sex] , axis=1 )
train_X = full_X[0:891]
train_y = titanic.Survived
test_X = full_X[891:]

选择模型,并进行5折交叉验证

1
2
3
4
5
6
from sklearn.model_selection import cross_val_score
model = GradientBoostingClassifier(learning_rate=0.01, max_depth=3, n_estimators=150)
# model = SVC()
# model = RandomForestClassifier(n_estimators=100)
# model = DecisionTreeClassifier()
cross_val_score(model, train_X, train_y, cv=5).mean()

输出:

0.8215596071618176

最后进行训练与预测

1
2
3
4
5
model.fit( train_X , train_y )
test_Y = model.predict( test_X )
passenger_id = full[891:].PassengerId
test = pd.DataFrame( { 'PassengerId': passenger_id , 'Survived': test_Y } )
test.to_csv( 'titanic_pred.csv' , index = False )

最后上传到kaggle的成绩是0.78947,排在Top32%

-------------本文结束 感谢您的阅读-------------